Search CORE

58 research outputs found

Disambiguating the species of biomedical named entities using natural language parsers

Author: Ananiadou Sophia
Tsujii Jun'ichi
Wang Xinglong
Publication venue: Oxford University Press
Publication date: 01/01/2010
Field of study

Motivation: Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive study on resolving the ambiguity in mentions of biomedical named entities with respect to model organisms and presents an array of approaches, with focus on methods utilizing natural language parsers

Crossref

PubMed Central

The University of Manchester - Institutional Repository

Using a Random Forest Classifier to Compile Bilingual Dictionaries of Technical Terms from Comparable Corpora

Author: Ananiadou Sophia
Kontonatsios Georgios
Korkontzelos Yannis
Tsujii Jun'ichi
Publication venue
Publication date: 01/04/2014
Field of study

Edge Hill University Research Information Repository

Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora

Author: Ananiadou Sophia
Kontonatsios Georgios
Korkontzelos Ioannis
Tsujii Jun'ichi
Publication venue
Publication date: 01/01/2014
Field of study

Crossref

Edge Hill University Research Information Repository

The University of Manchester - Institutional Repository

Evaluating contributions of natural language parsers to protein–protein interaction extraction

Author: Matsuzaki Takuya
Miyao Yusuke
Sagae Kenji
Sætre Rune
Tsujii Jun'ichi
Publication venue: Oxford University Press
Publication date: 01/02/2009
Field of study

Motivation: While text mining technologies for biomedical research have gained popularity as a way to take advantage of the explosive growth of information in text form in biomedical papers, selecting appropriate natural language processing (NLP) tools is still difficult for researchers who are not familiar with recent advances in NLP. This article provides a comparative evaluation of several state-of-the-art natural language parsers, focusing on the task of extracting protein–protein interaction (PPI) from biomedical papers. We measure how each parser, and its output representation, contributes to accuracy improvement when the parser is used as a component in a PPI system

PubMed Central

eScholarship - University of California

New challenges for text mining: mapping between text and manually curated pathways

Author: Kim Jin-Dong
Matsuzaki Takuya
Oda Kanae
Ohta Tomoko
Okanohara Daisuke
Tateisi Yuka
Tsujii Jun'ichi
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Associating literature with pathways poses new challenges to the Text Mining (TM) community. There are three main challenges to this task: (1) the identification of the mapping position of a specific entity or reaction in a given pathway, (2) the recognition of the causal relationships among multiple reactions, and (3) the formulation and implementation of required inferences based on biological domain knowledge. Results To address these challenges, we constructed new resources to link the text with a model pathway; they are: the GENIA pathway corpus with event annotation and NF-kB pathway. Through their detailed analysis, we address the untapped resource, ‘bio-inference,’ as well as the differences between text and pathway representation. Here, we show the precise comparisons of their representations and the nine classes of ‘bio-inference’ schemes observed in the pathway corpus. Conclusions We believe that the creation of such rich resources and their detailed analysis is the significant first step for accelerating the research of the automatic construction of pathway from text.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Text mining meets workflow: linking U-Compare with Taverna

Author: Ananiadou
Ferrucci
Hull
Jun'ichi Tsujii
Kano
Krallinger
Mio Nakanishi
Miwa
Paul Dobson
Settles
Sophia Ananiadou
Yoshinobu Kano
Publication venue: Oxford University Press
Publication date
Field of study

Summary: Text mining from the biomedical literature is of increasing importance, yet it is not easy for the bioinformatics community to create and run text mining workflows due to the lack of accessibility and interoperability of the text mining resources. The U-Compare system provides a wide range of bio text mining resources in a highly interoperable workflow environment where workflows can very easily be created, executed, evaluated and visualized without coding. We have linked U-Compare to Taverna, a generic workflow system, to expose text mining functionality to the bioinformatics community

Crossref

PubMed Central

Themes in biomedical natural language processing: BioNLP08

Author: A Airola
A Neveol
A Roberts
Bonnie Webber
Dina Demner-Fushman
H Kilicoglu
John Pestian
Jun'ichi Tsujii
K Bretonnel Cohen
K Verspoor
M Stevenson
P Corbett
Sophia Ananiadou
V Vincze
X Wang
Y Sasaki
Y Tsuruoka
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Edinburgh Research Explorer

Accelerating the annotation of sparse named entities by dynamic sentence selection

Author: A Culotta
A Globerson
A Vlachos
AA Morgan
B Settles
CA Thompson
D Okanohara
D Shen
EF Tjong Kim Sang
I Dagan
J Lafferty
J Nocedal
JD Kim
JD Kim
Jun'ichi Tsujii
K Tomanek
L Tanabe
LR Rabiner
S Engelson
S Kulick
S Sarawagi
Sophia Ananiadou
Yoshimasa Tsuruoka
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

The University of Manchester - Institutional Repository

Investigating heterogeneous protein annotations toward cross-corpora utilization

Author: A Arnold
A Yeh
AM Cohen
B Alex
B Efron
C Nédellec
CJ Kuo
EFTK Sang
EW Noreen
F Rinaldi
F Sha
G Zhou
H Daumé III
H Shatkay
HL Johnson
J Wilbur
JD Kim
JD Kim
Jin-Dong Kim
Jun'ichi Tsujii
K Franzén
K Yoshida
KB Cohen
L Gillick
L Tanabe
MA Mandel
R Bunescu
R Bunescu
R Kabiljo
RTH Tsai
Rune Sætre
S Pyysalo
Sampo Pyysalo
T Ohta
V Hatzivassiloglou
X Sun
Y Song
Y Wang
Yue Wang
Publication venue: BioMed Central
Publication date: 01/12/2009
Field of study

Abstract Background The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources. Results We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned. Conclusion Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Genia Event and Protein Coreference tasks of the BioNLP Shared Task 2011

Author: A Casillas
A Vlachos
A Vlachos
Akinori Yonezawa
C Quirk
D McClosky
D Tuggener
E Emadzadeh
H Kilicoglu
H Kilicoglu
H Liu
H Poon
J Björne
J Björne
J Björne
JD Kim
JD Kim
JD Kim
JD Kim
Jin-Dong Kim
Jun'ichi Tsujii
KB Cohen
L Hirschman
M Miwa
M Miwa
N Chinchor
N Nguyen
Ngan Nguyen
NL Nguyen
Q Le Minh
QC Bui
S Riedel
S Riedel
S Riedel
Toshihisa Takagi
Y Kim
Yue Wang
Publication venue: BioMed Central
Publication date
Field of study

Crossref

PubMed Central